This short report details the Standard Random Forest in the Manus Models.py code. The code has a training and test split by default from the trainingData.csv file. I do not know the provenance of this file.
traing_data =self.__read_local_file_into_dataframe("Analytics/trainingData.csv") X = traing_data[self.__features] y = traing_data['Diagnosis']self.target_population ="PD vs Not PD" x_train, x_test, y_train, y_test = train_test_split(X, y, test_size =0.3, random_state=392) model = RandomForestClassifier(n_estimators=100, random_state=1) model.fit(x_train, y_train)
The training and Random Forest classifier produced should be the same as is in the main code.
The test split was not used so I used this as a ‘blind’ test set to assess performance.
I also dump out all of the trees in the forest.
Important
The value of this blind analysis depends on the provenance of Analytics/trainingData.csv.
Initial backend commit is June 2021. This came from the original DiPar work but the current training file appears to include subjects from the Walker study so any test against 76patients_21HC is (potentially) biased. I can’t find the original model files to load the model derived just from DiPar data (comment in code just says it should be in the same folder. It isn’t.)
The actual training files don’t have subject IDs against them so it is hard to be definitive on the independence of the data but the blind plots do show an encouraging ‘more real world’ spread in probabilities.
dataset_details109.xlsx does have subject labels but the numbers don’t match trainingData.csv. trainingData.csv has 132 unlabelled FE data entries of 123 values. trainingData.csv comes from one of the first commits Martin committed 25 Jun 2021 : #4 - Initial commit of the backend functions
The Models.py code actually does a 70:30 train:test split on the data but then ignores the test. If you add the tests to it you get a ‘blind’ test accuracy of 0.8 and ROC AUC of 0.875 (as expected the train accuracy and ROC AUC are both perfect i.e. 1.0).
The default ‘ensemble’ of 100 trees is used in Models.py. The average result from applying all of the trees to a new sample is what determines the model prediction. This makes the tree more robust to individual data issues and dropouts.